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Architecture Generation Method 

1 Problem Statement 



A processor is formed from a number of individual functional units. Each of those units is 
able to perform different types of computation or storage tasks. In superscalar or VLIW 
machines there are often duplicates of the most heavily utilised functional- units. This 
allows more parallel computations to be performed. The processor architect determines 
the replication of and connectivity between these units. This is usually done on the basis 
of generalized statistics about the likely mix of instructions found in a wide range of real 
world applications. This means that the processor is designed for a very general workload 
and the instruction unit mix might not be ideal for any particular application domain. 

If an application specific processor is being synthesized then far greater efficiency can be 
obtained by determining the number and mix of execution on the basis of the particular 
application domain it is to be used for. The mix must be determined automatically by 
analyzing the type of code that the processor is going to execute. It is not sufficient to just 
measure the frequency of usage of different operations and replicate units on that basis. 
There is no advantage to be gained unless there are situations in which a number of 
replicated functional units could be usefully used in parallel. The optimisation must 
perform detailed analysis of the data flows within the program (especially the most 
performance critical parts) and use that to determine the right number of, and connectivity 
for, the functional units. 

Since the number and type of functional units and their connectivity may vary, so may the 
instruction word representation. An efficient instruction word representation is required. A 
processor without a fixed architecture provides a particular challenge in this regard. The 
average code density that is achieved is very heavily dependent on the efficiency of the 
instruction word representation. Good code density is an important attribute for an 
embedded processor where the program needs to be permanently stored in expensive 
non-volatile semiconductor memory. 



The nature of existing processor architectures is that the Instruction Set Architecture (ISA) 
is fixed. This means that the types of operations supported and thus the presence of 
appropriate functional units is also fixed. The unit mix is determined very early in the 
design process upon the basis of a wide range of applications. The concept of functional 
unit selection and replication being integrated into software tools on the basis of 
application code is commercially novel. 

The most relevant academic prior art is that for the automatic synthesis of TTAs 
(Transport Triggered Architectures). In this approach the starting point is a fully connected 
TTA with a number of duplicated functional units. The least used connections and least 
used functional units are gradually removed to produce a more optimised architecture. 
The performance of the architecture versus its silicon area can be plotted. This is similar 
to the techniques employed by CriticalBlue. The CriticalBlue approach differs in that 
connections and resources are added rather than removed. The architecture starts within 
a minimum number of connections and functional units and new ones are added as 
required. The CriticalBlue approach has a tighter integration between the code generator 
and the architectural optimisation. It also allows evaluation of different instruction word 
widths. The prior art TTA approach uses a fixed instruction word representation and a 
fixed number of connection buses (with a variable number of connections to those buses). 
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1 



t 



# 



CriticalBlue 



Architecture Generation 



3 Summary of Contribution 



An initial candidate architecture is created with a minimal connectivity network and one 
copy of each functional unit. Software code is then generated for the candidate 
architecture. As trial code is being generated, information is collected in a table about 
resources or connections that could be used to improve the code generation. For 
instance, a count is kept of the number of occasions in which data needs to be 
transported between two functional units, and there is not a direct connection between. A 
count is also maintained of the number of times that all functional units of a particular type 
are busy with existing operations when another operation of that type needs to be 
performed. 

The counts produced during code generation are then weighted by two different factors. 
Firstly, code from the software functions that have been marked as being critical for 
overall performance are given a higher weighting. Instructions within the inner loops of 
such functions are given a still higher weighting. The weightings are used to bias the 
resource table. The resource table represents changes to the architecture that would 
provide the greatest performance benefit to the architecture. 

The table is then examined and a single new resource is added. This might be a duplicate 
functional unit or a new connection between functional units. The area overhead of the 
new resource is compared against its weight in comparison to other potential resource 
additions. A choice is made for a new resource' taking into account both the area it 
occupies and its potential benefit When a new resource has been added to the 
architecture the code generation process is repeated. The addition of the resource should 
improve the performance of the architecture and also reveal what further resources 
should be added. 

The instruction word representation is optimised at the same time as the resources within 
the processor. The width of particular operations increases as new connections are 
added and thus the number of bits to control the operand multiplexers grows. As code is 
being generated a frequency table is maintained of how often particular operations are 
utilised. Again this is weighted by performance critical functions and inner loops so that 
they have more influence on the architecture. The instruction word layout attempts to 
allocate different positions in the instruction word for the most frequently used functional 
units. Thus simultaneous operations on these functional units can be issued without 
interference. Less frequently used functional units have their control bits overlaid with 
other functional units in the instruction word. Operations cannot be issued to both 
simultaneously. Since such a clash happens less often it has a smaller impact on overall 
performance. The width of the instruction word is varied as part of the architectural 
optimisation process. The instruction word can be any byte width of 32 bits or wider. 

A graph of estimated performance versus area for every candidate architecture is 
produced. The user is able to pick any architecture that has been produced and fix that as 
the final architecture for the system. In this way the architecture is molded to the particular 
requirements of the application. 

4 Architectural Overview 

4.1 General Philosophy 

One of the key requirements of the architecture is to support scalable parallelism. The 
basic structure of the micro architecture and the operation of the design tools are all 
focused on that goal. 
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Extracting parallelism from highly numeric loop kernels is relatively straightforward. Such 
loops have regular computation and access patterns that are easy to analyse. The nature 
of the algorithms also tends' to lend itself welt to parallel computation. The architecture just 
needs to balance the availability of computational resources (such as adders, multipliers) 
and memory units to ensure the right degree of parallelism can be extracted. Such 
numeric kernels are common for DSPs. The loops tend to lack any complex control flow. 
Thus DSPs tend to be highly efficient at regular computation loops but are very poor at 
handling code with more complicated control flow. 

Other than in numeric computation loops, C and C++ code tends to be filled with 
complicated control flow structures. This is simply because most control code is filled with 
conditional statements and short loops. Most C++ code is also filled with references to 
main memory via pointers. The result is a code stream from which it is extremely difficult 
to extract useful amounts of parallelism. In average RISC code, approximately 30% of all 
instructions are memory references and a branch is encountered every 5 instructions. 

General purpose processors for PCs have to deal with this kind of code and extract 
parallelism from it in order to achieve competitive performance. The complexity of PC 
processors has mushroomed in recent years to try and deal with this issue. The control 
logic for a modern PC processor has literally millions of transistors dedicated to the task of 
extracting parallelism from the code being executed. The extra hardware needed to 
actually perform the operations in parallel is tiny in comparison with the logic required to 
find and control them. The main method utilised is to support dynamic out-of-order and 
speculative execution. This allows the processor to execute instructions in a different 
order from that specified by the program. It can also execute those instructions 
speculatively, before it knows for sure whether they should be executed at all. This allows 
parallelism to be extracted across branches. The difficult constraint is that the execution 
must always produce exactly the same answer as would result if the instructions were 
executed strictly one by one in the original order. 

The control and complexity overheads of dynamic out-of-order execution are far too high 
for a CriticalBlue processor. There is a significant cost overhead due to the area occupied 
by the control logic, not to mention the cost of designing it. Additionally, such logic is not 
amenable to the scalability requirements of the CriticalBlue architecture. 

A number of recent developments in the area of micro architecture have been focused on 
VLIW type architectures. There is a "back to basics" movement that seeks to place the 
burden of extracting parallelism on the compiler. The compiler is able to perform much 
greater analysis to seek parallelism in the application. It is also considerably simpler to 
develop than equivalent control logic. This is because the control logic must find the 
parallelism as the program is running so must itself be highly pipelined and suffers from 
the physical constraints of circuit design. The compiler performs all of its work up front in 
software with the luxury of much longer analysis time. For most classes of static 
parallelism, compiler analysis is very effective. 

Unfortunately, software analysis is poor at extracting parallelism that can only be 
determined dynamically. Examples of these are branches and potentially aliased memory 
accesses. A compiler can know the probability that a particular branch will be taken from 
profiling information, but it cannot know for sure whether it will be taken on any particular 
instance. A compiler can also tell from profiling that two memory accesses never seem to 
access the same memory location, but it cannot prove that will always be the case. 
Consequently it is not able to move a store operation over a potentially aliased load 
operation as that might affect the results the program would generated. This restricts the 
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amount of parallelism that can be extracted statically in comparison to that available 
dynamically. 

CriticalBlue employs a unique combination of static and dynamic parallelism extraction. 
This gives the architecture access to high degrees of parallelism without the overhead of 
complex hardware control structures. The software tools perform all the analysis, moving 
the complexity burden away from the hardware. The CriticalBlue architecture itself is fully 
static in . its execution model. It executes instructions in exactly the order specified by the 
tools. These instructions may be out of order with respect to the original program, if the 
tools are able to prove that the re-ordering does not affect the program result. This 
reordering is called instruction scheduling and is an important optimisation pass for most 
architectures, and especially for CriticalBlue. 

CriticalBlue has a revolutionary execution model that also allows it to perform out-of-order 
operations that cannot be proved as safe at code generation time. In general it only 
perform these optimisations if it knows that they will usually be safe at execution time. The 
hardware is able to detect if the assumptions are wrong and arrange a re-execution of the 
code that is guaranteed to produce the correct answer in all circumstances. The hardware 
overhead for this "hazard" detection and re-execution is very small. 

The diagram below illustrates the power that this style of execution gives the CriticalBlue 
architecture: 
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An example segment of code is shown on the left hand side. This is composed of 
individual basic blocks. A basic block is a segment of code that is delineated by a branch 
operation. If execution enters a basic block then all instructions within it will be executed. 
At the end of the basic block there is a branch instruction that causes execution to 
continue with one or two different possible successor blocks. The condition upon which 
the branch is performed is generally calculated in the code of the basic block so it is not 
possible to know before entering the block which route will be taken. Each execution of 
the code will produce a particular path of execution through the basic blocks. Certain 
paths may be considerably more likely than others but any route may be taken. 
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The middle section of the diagram shows the execution in a conventional processor that 
does not support any kind of out of order execution. This is typical of RISC processor 
cores. Each basic block as to be executed one after the other, as the branches are 
resolved. 



The right hand section shows the execution model for CriticalBlue. It is able to execute 
code from a number of different basic blocks in parallel. It does this to increase the 
amount of parallelism and efficiency of the architecture. It might know that one basic block 
is very likely to follow another. It can pull instructions forward from the second block to 
execute in parallel with the instructions of the first. This allows calculations to be started 
earlier (and thus finish earlier) and to more effectively balance the resource utilisation of 
the processor. The CriticalBlue scheduling algorithm does this while taking account of the 
probability that the execution of a particular instruction will be useful. 

Executing code speculatively in this manner normally requires a significant hardware 
overhead. If a particular block should not have been executed then any results it has 
produced must be discarded. This is referred to as "squashing" the execution. In 
particular, any store operations that the code has performed must be undone as they 
could permanently pollute the memory space with incorrect results. CriticalBlue employs 
mechanisms that allow the benefits of speculative execution while only requiring the 
minimum of hardware overhead. 



4-2 Functional Units 

The internal architecture of functional unit is shown below: 




The central core of a functional unit is the execution unit itself. It performs the particular 
operation for the unit. New functional units may be created using user defined execution 
units. The CriticalBlue tools automatically instantiate the required "glue" blocks around the 
execution unit in order to form a functional unit. These glue blocks allows the functional 
unit to connect to other units and to allow the unit to be controlled from the instruction 
word. 



Functional units are placed within a virtual array arrangement. The dashed lines shown on 
the diagram illustrate this. Individual functional units can only communicate with near 
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neighbours within this array. This spatial layout prevents the architectural synthesis 
generating excessively long interconnects between units that would significantly impact 
clock speed. 

The control inputs include the segment of the instruction word that controls that particular 
functional unit. The method selector is passed directly to it. Other fields are used to control 
the multiplexers that select data from buses. Each operand input to the execution unit 
may be chosen from one of a number of potential data buses using a multiplexer. In some 
circumstances the operand may be fixed to a certain bus, removing the requirement for a 
multiplexer. The number of selectable sources and the choice of particular source buses 
are under the control of the CriticalBlue architectural optimisation tools. 

All results from an execution unit are held in independent output registers. These drive 
data on point-to-point buses connected to other fractional units. The point-to-point nature 
of these buses minimises power dissipation and propagation delays. Data is passed from 
one functional unit to another in this manner. The output register holds the same data until 
a new operation is performed on the functional unit that explicitly overwrites the register. 



4.3 Communication Architecture 

Although the CriticalBlue architecture does have a central register file it is treated like any 
other implicit functional unit. All accesses to the register file have to be explicitly scheduled 
as separate operations. Since the register file acts like any other functional unit its 
bandwidth is limited. The code is constructed so that the majority of data values are 
communicated directly between functional units without being written to the register file. 

Traditional architectures have a centralised register file that has customized access ports 
to ail of the functional units. Access to the register file is implicit in the instruction layout 
and semantics of the instruction set. The register file is used to feed the operands of the 
execution units and hold the results generated by them. Unfortunately such a centralised 
register file imposes a significant restriction on scalability. As the level of parallelism in the 
instruction stream increases so does the number of access ports required to a centralised 
register file. These are needed to provide operands to and write back results from all the 
active execution units. The complexity of the register file grows at approximately N 3 where 
N is the number of access ports. The register file soon becomes the bottleneck in the 
design and starts to have a strongly detrimental affect on the maximum clock speed. 

Given the requirement to make CriticalBlue highly scalable, communication of all data 
through a centralised register file is not a viable architectural option. Whenever a 
functional unit generates a result it is held in an output register until explicitly overwritten 
by a subsequent operation issued to the unit. During this time the functional unit to which 
the result is connected may read it. 

A single functional unit may have multiple output registers. Each of these is connected to 
different functional unit or functional unit operand. Since all connection buses are single 
source/sink there is a one-to-one correspondence between output registers and 
connections. The output registers that are overwritten by a new result from a functional 
unit are programmed as part of the instruction, word. This allows the functional unit to be 
utilised even if the value from a particular output register has yet to be used. It would be 
highly inefficient to leave an entire functional unit idle just to preserve the result latched on 
its output. In effect each functional unit has a small, dedicated, output register file 
associated with it to preserve its results. 

An example functional unit array is shown in the diagram below: 
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Given the connectivity limitations of the functional unit array, not every unit is connected to 
every other. Thus in some circumstances a data item may be generated by one unit and 
needs to be transported to another unit with which there is no direct connection. The 
placement of the units and the connections between them is specifically designed to 
minimise the number of occasions on which this occurs. The interconnection network is 
optimised for the data flow that is characteristic of the required application code. 

To allow the transport of such data items, any functional unit may act as a repeater. That 
is it may select one of its operands and simply copy it to its output without any 
modification of the data. Thus a particular value may be transmitted to any operand of a 
particular unit by using functional units in repeater mode. A number of individual "hops" 
between functional units may have to be made to reach a particular destination. 
Moreover, there may be several routes to the same destination. The code generator 
selects the most appropriate route depending upon other operations being performed in 
parallel. 

There are underlying rules that govern how functional units can be connected together. 
Local connections are primarily driven by the predominate data flows between the units. 
Higher level rules ensure that all operands and results in the functional unit array are fully 
reachable. That is, any result can reach any operand via a path through the array using 
units as repeaters. These rules ensure that any code sequence involving the functional 
units can be generated. The performance of the code generated will obviously depend on 
how well the data flows match the general characteristics of the application. Code that 
represents a poor match will require much more use of repeating through the array. 

CriticalBlue uses a truly distributed register file concept. The main register file does not 
form a central bottleneck that would severely hamper the scalability of the architecture. 
The majority of communication in the CriticalBlue architecture occurs directly between 
functional units without a requirement to use the central register file. Small blocks of 
registers local to each functional unit hold data values until they are required by 
subsequent operations. The connections between the functional units and their placement 
are optimised synergistically to adapt to the data flow present in the application. 
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nit operands. Control units are not connected to the execution word bus in this manner 
so do not have decode positions specified. 

The MetaMapper is not able to generate new architectures that have a mixture of 
specified and unspecified decode positions. The execution word optimisation requires that 
all positions are floating. The decode positions are specified in a fixed file since they are 
required for code generation. If the user adding additional units to an already fixed file by 
manual modification of the configuration file then the user must specify suitable decode 
positions. 

The decode position is specified by four different attributes as follows: 
5.4.1 First Opcode Bit 

This specifies the first bit in the execution word that forms part of the operation code bus 
for the unit. The value of this bus is compared against a fixed, allocated value, for each 
unit to determine if it is selected on that clock cycle. 

5A2 Opcode Width 

This specifies the total width of the operation code bus that is distributed to the unit. 

5.4.3 Opcode Value 

This specifies the value required on the operation code bus to cause the unit to be 
selected. Unique operation codes (within each distinct operation code bus) are 
automatically allocated as part of the architecture optimisation process. 

5.4.4 First Control Bit 

This specifies the first bit in the execution word that forms the control bus for the unit. The 
control bus contains the method field, any immediate field, immediate strand and the 
multiplexer settings for each of the input ports to the unit. The width of the control bus is 
determined by the combined width of the method port, immediate port, immediate strand 
and input port selection widths. 

5.5 Method Selector Port 

This specifies the width of the unit method port in bits. A zero length method is port 
acceptable. Control units do not have a method port and set this field to.O. 

In general the method port is used to select which particular method supported by a 
functional unit is to be performed. The selected method will determine which operands 
need valid values and which output results will be generated. Each method has a unique 
method number that is specified in the list of methods for the unit. 



5.6 Immediate Port 

This specifies the width of any immediate port associated with the unit. If an immediate 
port is not required then a value of 0 width can be specified. Control units have this field 
set to 0. 

The immediate port is adjacent to the method port and is held in the same latch for 
presentation to the execution unit at the start of the execute cycle. It is used for specifying 
immediate values to the execution unit. For instance, for immediate units it is used for 
specifying the actual literal value. For other types of unit it may be used to specify bit 
significant values that are used by all methods. 
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5.7 Immediate Strand Flag 

This flag indicates if an immediate strand is specified for a functional unit. All units that can 
perform operations that can permanently change the machine state must have associated 
immediate strands. This allows their operation to be disabled if the associated strand is 
squashed. Units that are only used for performing speculative operations (such as some 
arithmetic units) do not need to specify an immediate strand value. 

Control units have this flag set to false. 

5.8 Controller Glue Units 

A list of possible controller glue units to be used with the unit is provided. If the no 
controller should be used or the unit is a control unit then the list is empty. The glue units 
are specified by the appropriate module names that are instantiated as required by the 
structural HDL generated by MetaBuilder. ■ 

5.9 Delay Pipeline Glue Unit 

This field gives the name of the delay pipeline glue unit that should be used. This delay 
pipeline glue unit is used to delay the strand number for the operation by the same 
number of cycles as the latency of the execution unit. The pipeline unit is selected using a 
module name that is used to instantiate it in the structural HDL generated by MetaBuilder. 
A single pipeline unit is selected to correspond to the latency of the unit. 

No pipeline glue unit is specified for control units. If no pipeline glue unit is required then 
this field is left blank. 

5.10 Run-Out Glue Unit 

This field gives the name of the run-out glue unit that should be used if any. A single run- 
out unit is able to handle all result ports from the functional unit. The run-out should be 
compatible in terms of maximum delay cycles and total width for the functional unit. The 
run-out is able to hold the state of outputs from the functional unit if there is a pipeline stall 
or interrupt The run-out unit is selected using a module name that is used to instantiate it 
in the structural HDL generated by MetaBuilder. 

No run-out glue unit is specified for control units. If no run-out glue unit is required then 
this field is left blank. 

5.11 Input Ports 

This is the list of input ports to the execution unit. The list must include all ports even if 
only some are need to have valid values for particular methods. Method selector and 
immediate ports do not need to be specified. 

5.11.1 Port Name 

This is the name of the port. The specified port name corresponds to the name used in 
the hardware description of the unit. The structural HDL generated by the MetaBuilder 
uses the name. If the port is dedicated then it has a unique hw_ prefix. 



5.11.2 Port Width 

This is the width of the port. Bit widths between 1 and 32 bits are legal. 
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v number of different possible output bank units may be listed, each proceeded by a 
number. The number represents the number of registers the unit supports. There must 
only be one selection for each value. Each unit must be compatible in terms of the data 
width supported. The MetaMapper optimise. selects the appropriate output bank unit for 
as required by the number of operands being fed by it. 

For a fixed configuration file a single output bank unit may is annotated. 

5.13 Latency Attribute 

This indicates the fixed latency of the functional unit in terms of clock cycles. All results 
from the functional unit from all methods must have the same latency. The minimum 
latency is 1 and the simulation environment fixes a maximum latency. If an operation 
actually has a variable latency then the most frequently occurring fixed latency should be 
specified. This may be extended dynamically using the pipeline wait mechanism. 

High latency execution units can lower code density when they are used as the region 
must be at least as long as the highest latency of unit used in the region. Code may have 
to be padded with no-operations while the results are being calculated in the unit. 

5.14 Method List 

A list of the methods available for the unit must be specified. A control unit does not 
require any methods to be listed, as it is implicit that it performs a fixed operation on each 
clock cycle. All functional units must specify one or more methods. If a the unit has a zero 
length method selector field (and therefore only one method) then that single method 
must be listed with a method number of 0. 

Each method may read input ports and write to output ports. A parameter list is specified 
for each method to show which subset of the ports for the unit is nead/written. These 
correspond directly to the parameter list for a software function call. 

The following are the fields that must be specified for each individual method: 
5.14.1 Method Name 

This is the name of the method and corresponds directly to a function identifier. Thus all 
method names must start with the hw_ prefix. All calls to the function of the given name 
are automatically converted into a use of the hardware method. All method names must 
be unique, both within a particular unit definition^ and globally across all unit methods. 
Some method names for implicit functional units are predefined. 

5.142 Method Number 

Gives the number of the method. This is the value that is presented into the method 
selector field to initiate that particular method. Decode logic within the execution unit must 
select the corresponding operation. All method numbers for the unit must be unique and 
small enough to be specified in within the width of the method selector port. 

5.14.3 Return Value 

Gives the name of an output port to be used for a return value. This must be the name of 
an output port for the unit. If the software function for the method returns no result (i.e. it is 
a void function) then this field is left blank. 

5.14.4 Parameters 

Specifies the port names of the parameters to the method. There is a one-to-one 
correspondence between these parameters and the software parameters of the 
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the result should x^rif+o« ? ^ • functlon ca" ' s taken to be the address of where 

SSHSr~s ass 

implementations of such methods 6 *" P ° mter haS n ° Use for hardw ^ 

5.14.5 Speculation Attribute 

muttno* Sllf.'SSi w'SSnSES te ^,^ula^e,y. Sue a meUtod 
method invocation fe ^feraed I A con ? otness <* Program results if the 
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5.14.6 Blockage Attribute 

5.14.7 Writing Dependence Set 
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which reading methods from the same set cannot be moved. The dependence sets are 
global across all functional units so ordering can be maintained between operations 
issued to different units. A particular method may write to any number of dependence 
sets (including none). 

5.14.8 Reading Dependence Set 

This field lists the dependence sets from which the method reads. This restricts 
movement of these methods across writes to the same dependence set. Reads from a 
particular set may be freely scheduled with respect to each other. However, a read from a 
dependence set may not be scheduled over a write to the same set. A particular method 
may read from any number of dependence sets (including none). 

As an example, dependence sets may be used for a floating point unit that includes a 
sticky overflow bit The sticky bit represents internal state that impacts the order in which 
floating point operations may be performed. Assume there is a sticky bit read operation 
that obtains the state of the sticky bit and then clears it. This will be represented as writing 
to the floating point dependence set. Ail the operations (such as add, multiply, subtract 
etc) that may set the sticky bit will be represented as readings of the dependence set. 
This is because the order of the arithmetic operations will not impact the final state of the 
sticky bit (any overflow causes it to be set). However, arithmetic operations must not be 
moved over the clear and read of the bit. 



6 Optimisation Flow 



6.1 Flow Diagram 

The diagram below shows the flow of individual optimisation steps for both architectural 
optimisation and the code generation process: 



Until Maximum Unite/Area 



Architectural Optimisation 



r 



Idealised CDFG 
Unit Allocation 



. For All Code Regions 

c . 

Dataflow Calculation 

Unit Placement 
Neighbou^Connectlons 
Reachability Closure 

Until Maximum Connections 
♦-Execution Word Optimisation 

Code cJneratlon 

•Add New Connection 

Add New Unit 




For All Code Regions 
• Idealised CDFG 

l 

Unit Allocation 
Shadow CDFG 

; 

Shadow Operation Scheduling 

Main CDFG 
I 

Transport Optimisation 

\ 

♦ Main Operation Scheduling 
Code Generation Process 



Once the architecture has been fixed and new code is to be targeted to a processor then 
only the code generation process needs to be performed. As shown the code generation 
process is actually a subset of the full architectural optimisation. 
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During the architectural optimisation an initial architecture is created with minimal 
functional unit and interconnect resources. Additional interconnect is then added to 
improve the performance of the architecture. This is measured by generating code 
schedules on the candidate architecture and estimating their performance. When a 
certain threshold of new connection addition is reached then additional functional units 
may be added to improve parallelism. Whenever new functional units are added the 
interconnects are restored to the initial minimal connectivity. The connectivity optimisation 
process is then repeated. 



6.2 Architecture Optimisation Output 

The output of the architectural optimisation process is a series of candidate architectures 
with particular performance characteristics. The configuration information for each of the 
architectures is stored and the user can then select any individual architecture that most 
closely matches the required area and performance requirements of the application. 

The output of the tool is in the form of a graph representation. An idealised representation 
of the graph is shown in the following diagram: 
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The graph shows the estimated performance versus area requirements for all the 
candidate architectures tested. Each cross on the diagram represents such a candidate 
architecture. A dashed line in the diagram joins all the candidates that are constructed 
from the same set of functional units. Each candidate has a different number of additional 
connections added. 

In general the shape of the curves produced will be similar to those on the diagram. An 
initial architecture can be improved by the addition of new connections but ultimately there 
is a law of diminishing returns as new connections are added. In reality the curves 
produced will not be nearly as smooth as particular idiosyncrasies of particular 
architectural configurations will produce a noise level on the general improvement of the 
architecture as new connections are added. 
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As illustrated on the diagram the disadvantage of architectures with greater numbers of 
functional units .s that the minimal area becomes larger. In general, however the^ereae 
performance should be better presuming that the parallelism afforded b Ihe aSSSS 
functional units can be efficiently utilised. <*raraonai 

J^^iw^f 6d by J he ^? iti0n 0f 016 area of a » *» ««*nal units employed The 
library definition for each unit has an attribute of the amount of area occupied for a 

ffE5 f T *E h u 0,09y - Additi0nal area is also estimated for »» conneS between 

ssrsr used depending upon - target techn °^ «se 

c"Jde?^mlm^Tth nCe l S b f S f ° n 3 Weighted ^^entation of the number of clock 
cyces to implement the critical functions within the application. The number of dork 
cyctes 1S obtained by generating code for the critical functions to mn on ^e ca^dSe 
arch, ec ure. The library blocks and connectivity rules are constructed si That the 
architecture is designed to run at a particular clock frequency for a partteularteraS 

SSS^S^I^^ mu,tip,i f is designed tobe a — S-SJS 

ohhi S . < genarated - If *» user selects a particular candidate architecture then 
EttSSSEX: " aVaHab,e Sh ° Wing Perf0fmanCe ° n individual -tiStnctSns 

6.3 Architectural Optimisation 

« 52!ff?^? aBOn PF f 658 0CCUrS in ^ main P hases - ™ese three phases 
nSSfo r ? * 5 9enerate a range of ^ndidate architectures for the user The 
SS^^SSJS^^ the , m i nimum number of ^tiona. units. This is one of 
each type of functional unit required from the library. The quantity of a particular tvoe of 

ex n ;S'„Te ZSZSST" in "* to ,mprave the amourt - - * 

6.3.1 Quantitive Dataflow Estimation 

The purpose of the quanlatlve dataflow estimation is to generate a table of the data flow 

MM the archileotura. initial?, ihe fu4S 

arohSuX A ^nsStSTSS ^ addl ? on , al V*™ 1 units are to *• <*"«idate 

All the individual steps below are repeated for all the regions in the application oroaram in 
order to provide data for the whole application. The data flow results Irom each region are 
they cSe aCC ° rd,ng * *" ° f the funCtions < and <* 'oopTsthg) Z < 5h5 

6.3.1.1 Idealised CDFG 

^J™ S ? P fe !° PT ^ 0Ce 3n idealised CDFG for re 9 ion - Such a CDFG ignores the 
SXEL ^ 9 T r fi S r6ad and *** ^rations to maintain correct state Sn the 
a ow i tne actual data fl °w between operations to be measured This step 
■s identical to that performed during the standard code generation process. 

6.3.1.2 Unit Allocation 

Once the idealised CDFG has been constructed the particular units to be used for 
operations in the CDFG can be selected. Obviously this is straightforward for Sses ^ere 

spatial analysis is used to select an appropriate candidate. This step is similar to that 
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performed during the standard code generation process. However, since the physical 
placement of the funct.onal units has not been determined at this stage the alloS 
does not make any account of the transport costs associated with a paTcularSm 
The decision .s based purely on maximising theoretical parallelism all °cation. 

6.3.1.3 Dataflow Calculation 

IS T^^^tS^ C hT G 38 ann ° tated the unit a,location s- Each of the 
f u, ?!. S S examined and rt s impact is recorded in a dataflow mat-x Entries aZ 
weighted depend.ng upon the criticality of the function from which the reafon" is oblinpd 

6.3.2 Base Architecture Creation 

?h r l Ce hfl e dat f ™ 'nformabbn has been gathered the base architecture can be created 
The base architecture is the set of functional units included in the architecture o us Jie 
initial connect.ons between them. Further connections are then Idded to thi L«! 
archrtecture to further improve performance. A number o ^c^ndWate base ScNterS 
may be generated as additional functional unit resources are added arch,tectures 

6.3.2.1 Unit Placement 

£fl£ '322$£? m *iT "* * *» «««*~ iS SZ funcS on™ 
™JE,?, , that a rectangular area of allocated units is maintained The exact 

6.3.2.2 Neighbour Connections 

?nn!!^! £*♦ generated between neighbour functional units within the array The 

SSJSSST' ™ aB£ * ed ** ^ 11,6 ** P«*°minate data" S ta£S Te 

6.3.2.3 Reachability Closure 

This step introduces additional connections into the array in order to ensure that all r^rt* 

fransooS^K ThiS means *« a resuIt from any S in 2 

transported to the operand of any other unit within the array. 06 

6.3.3 Connectivity Optimisation 

Il!fh ? U T 0Se ° f 1,16 connectivit y optimisation is to add additional connections to the base 
archrtecture .mprove performance. Each iteration produces a new^atearZlc^ 

count is based on the number of functional units in the array. More connections maJS 
thafl 10 /^ 3 ™ th ™ re ™**- ™s provides new connection ih?n?3S3 to the pC* 

The connectivity optimisation consists of a number of individual steps as described below 
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^\6.3.3.1 Execution Word Optimisation 

Th il Step fi 9e T rateS 016 execution w °rd layout for the processor. The execution word 
width is a fixed input parameter to the optimisation process The execution word S tn hf 
updated after tte addition of each new'connection'has the ^Z TaZ ^neSiSn 
may increase the width .of a selection field for a particular functional unit Zs TTavout 

^^^^s^zr the — * - « ^ 

6.3.3.2 Code Generation 

tn h Lf eP ^ erf0r T S J! 18 00016 9 eneration as described previously. This process is identical 

6.3.3.3 New Connection Addition 

^L2° iCe °l" eW connections ^ driven by the transport optimisation phase of the code 
EES?™* I «qiwrt matrix for the Xle applicafon 

So ?k neCt, ° n Wth the highest wei 9 ht ^at can be added is included frrto the 
Th6re 3re 8 nUmber of frictions on the addition of new c^nnecSons Rrstiv 

connection^rrn^ number and a maximum numbe? of oufotf 

connections from a unit. These are parameterised as part of the library definition so that a 

SSlf l 6 *- 006 ?' ng fret * uenc y <*n be achieved. If a particular unit L less Zino slack 
the the cho.ce of connections is made more limited. Secondly there is a resttSn™ £ 

£X£,r° f % COmeCti0n in thS anay - ° n,y connections between n^hbour an nea!- 
lSXtn are Perm ' tted - ™ 8 ° f M * area « te Parameteris'e^ 

6^3.4 New Unit Addition 

^n^l^ adC !!! i0n iS 1,16 final steD in ** architectural optimisation process It is 
performed when the addition of new connections to a base archSctu?eTas b Ln 

SLTh b ' e £ 1,16 architecture - choice of the unit type is based I on^JEftS 

S be 1 L2E2SL? 3 / 9enerati0n Pr0cess showin 9 *° most frequency used unrts foat 
can be replicated (memory units, register file units, I/O and some other Woes of uniS 
cannot be replicated). The usage count of a unit includes th ^ copTopemttor^ reared £ 
Zl so^h 3 "" OU L°V h f ?* ^ ff P erf ^ance can L^22?S^S?^ 
in i ihe usag^counr t0 06 ^ «■ is W»U 33S 

6.4 Function Weighting 
6.4.1 Philosophy 

The critical function list indicates to the architecture optimiser which functions are kev to 

basisTtL^ 3 " 06 ° f an aPP ' iCation - 7,16 critical list ™y be fomeTon a?ad hoc 
bas.s by the designer or, more commonly, from performance profiling of the application. 

The critical function list allows the optimiser to balance connectivity resources denpndinn 
upon need to individual blocks of code. If a new connecftTgre^ 

^aTXX^° n T MS iS ' ike,y 10 ^ a good'aS^h^vrfo: 
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6.4.2 Format 

A simple format is employed for the critical function list. Each line of the file corresponds to 
a particular function. The line specifies the name of the function and the weight it should 
be given in the optimisation process. The weight is specified as a percentage so values 
between 1 and 100 are acceptable. 

The total of all weights in the file should be 100 or less. If the total weight is less than 100 
then the remaining percentage is allocated equally amongst all other functions not listed in 
the file. 

Functions may only be listed once in the file and the function must be implemented in 
software running on a CriticalBlue processor. Functions that are implemented in hardware 
or in software on the main processor cannot be listed. 



7 Architecture Creation 

7.1 Overview 

The tools must create connections between execution units in response to the data flow in 
the application. The network must be a good fit for the application but still retain sufficient 
connectivity so that the processor is still able to execute genera! purpose code without 
crippling inefficiency. The optimisation problem may be further constrained by total area 
requirements or a limit on the number of inputs or outputs for a single operand 
multiplexer. The goal of the optimisation is to produce an architecture that is minimal, but 
sufficient. The overall utilisation of the connections and execution units in the application 
must be very high. Unnecessary connections just add a cost burden to the processor. 

There is a strong synergy between the code scheduling and architectural synthesis 
optimisations. These two problems should not be considered in isolation. If the 
architectural synthesis produces a good architecture it is of no use if the code scheduler is 
unable to make efficient use of it. Equally, there is no point in producing good schedules 
for a fundamentally bad architecture. The CriticalBlue approach intimately ties these two 
optimisation problems together to avoid these types of mismatches. Indeed, it is the code 
scheduler itself that decides when to add new connections to the architecture. 

The basic approach employed for is to generate trial schedules on candidate 
architectures. An initial trial architecture is produced with the functional units as specified 
by the user. A minimal connectivity network between functional units is also generated. 
This is a minimal architecture that will be optimised by adding customised connections. 

The effectiveness of a particular architecture is measured using attributes of the 
scheduled code. These include the estimated execution time for particular code blocks 
and average utilization of processor resources. The user supplies a list of functions that 
are critical to the overall performance of the application. A weight can be annotated for 
each function to show its contribution to the overall execution time. The functions can be 



19 



' CriticalBlue - J^fe _ 

Arcm^ture Generation 

"identified and proportions determined by usinq orofilina took Fnr a n^rt^... . 
architecture the overall architectural fitness calcuE is S mSe on^ ^ 
generated code efficiency weighted by its importanceto« ° f 

to add new connections. THese connection requests an, driwnaiS SmS MtaEJSS 
Jetton is^cSy 'rnaT a7«?'Sa2T^' S , ¥ de * ar,d 80 «"■ «*» a 

required for their paracular appHcation ° ff area and P«**nanoe as 

7J2 Interconnect Philosophy 

untt are never, or only rarely, used by another unit then there is Sf valu^Sg a 
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^direct connection between them. The number of possible connections grows as a square 
of the number of functional units and each one occupies area on the chip. Also, as the 
number of connections to a particular operand grows the input multiplexer becomes 
slower, lowering the potential clock speed for the system. 

Each functional unit has a number of operand inputs and outputs. The output values are 
held in output registers. Each output drives a single source and single sink bus that is 
connected to an input of one other functional unit. 

7.3 Unit Placement 

The units in the processor are placed in a virtual array. This phase only places the units, 
while connections between them are added during the following phases. 

The placement occurs in a virtual grid with the first unit placed in the middle of the grid. 
The grid can expand in any direction but is forced into a rectangular shape. New units are 
always added at the boundary with the rule that the maximum extend rectangle cannot be 
more than one unit larger in any direction than the enclosed fully utilised rectangle. The 
aspect ratio of the rectangle is a parameter and is used to modify the fitness metric for 
particular placements to try and force the growth into the required shape. 

Placement is driven from an idealized representation of the code. In this form it is 
assumed that all execution units are able to communicate directly. Thus there are no 
transport latencies and the actual scheduling is not performed. A weighted data flow 
matrix is built up as a result of the analysis showing the level of communication between 
particular execution units. Each row/column represents a particular execution unit 
operand or result - if there are multiple copies of a unit then they are represented 
separately. 

Placement always starts with the register file. This is chosen because it is a key unit and 
the amount of communication it requires with other units will be under-represented due to 
the idealised assumptions made during the CDFG optimisations. Thus placing the unit at 
the origin increases the opportunities to add additional connections to it later in the 
connectivity optimisation. The register file is also guaranteed to use 32 bit values, the 
maximum possible. This allows a simpler algorithm to ensure the reachability of data 
values of a certain width throughout the unit array. 

The remaining functional units are placed on the basis of decreasing weighted frequency 
of use. Thus the placement aims to place the most frequently used units near the centre 
of the virtual array and the less frequently used units on the periphery. Some units may be 
marked as being I/O units that must be allocated to the boundary of the virtual array. A 
particular side of the array may be selected. These units are always placed last 
independently of their usage frequency. All units that only have inputs or only have 
outputs are also placed at the edge of the array. These are units are placed on the edge 
because they are unable to act as repeaters. Note that there may be more units that need 
to be placed on the boundary than there are boundary positions. In that case the 
placement does not complete. Additional non-boundary units are replicated until the 
boundary size is sufficiently big. 

The diagram below shows the expansion of the unit placement on the basis of decreasing 
frequency of use. Certain I/O units are placed on particular boundaries as shown: 
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The actual positioning a placement is dependent on the communication requirements of 
the unit Each free slot on the boundary of the virtual array is considered as a candidate. 
The communication fitness is calculated by the total amount of communication to each 
another unit multiplied by the Euclidian distance to the unit in the array. Communication 
with units of the same type is not included in the calculation (thus reducing the chances of 
units of the same type being placed close to one another). Thus new units will tend to be 
placed close to units with which they have high communication. This also means that 
there is a higher chance that a short cut connection could be added during the 
connectivity optimisation stage. Obviously this is an approximation since the actual 
transport cost will depend on the neighbour connections actually created. 

The diagram below illustrates the effect of the unit placement Consider that the unit in the 
op left comer is being placed into the array and that position is a candidate for its final 
placement. The placement is good as the majority of communication (as shown by the 
thickness of lines of the diagram) is with neighbour units. Thus direct connections can be 
created with them avoiding the requirement for transport copies via other units. Other 
placements of the unit on the other side of the array would be significantly less optimal. 
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A further restriction controls the placement of the units. This is the requirement to transmit 
data values without over-truncation to other units in the array. The maximum bit width 
requirement is maintained for all unplaced units. This is the maximum width of any 
operand or result port. There must always be at least one position held free adjacent to a 
unit with an operand and result port the same width or wider than the maximum width of 
unplaced units. If this prevents any placements being made then the highest unplaced 
unit may added to the pool of candidates. This mechanism ensures that it is always 
possible to build routes to transmit data values of the required size to all units in the array. 

7-4 Neighbour Connections 

This phase adds connections between unit neighbours in the virtual array. All horizontal 
and vertical neighbours are linked via north-south and west-east connections. . This 
connectivity is mandatory and serves two main purposes. Firstly, it ensures the units are 
formed into an array that will direct the place and route of the eventual HDL so that it 
reflects the spatial model created by the tool. Secondly, it forms the basis of the minimum 
connectivity network for the processor. All units are able to copy operands so'if a value is 
available at any unit in the array it can be transported to any other. 

The units are visited in the same order that they were placed in the array (the ordering is 
stored). A connection is added for each NESW neighbour of the unit. The number of 
neighbours will vary between 1 and 4 depending upon the unit. The sides are visited in 
order of distance from the central unit so that those connections (i.e. connections to the 
centre of the array) are given priority. If a neighbour connection already exists then and 
additional one is not added. For each unconnected neighbour the most appropriate 
operand or result port is selected. Multiple connections to particular result ports or 
operands are not made unless absolutely required (i.e. if a unit only has one operand and 
result but four neighbours). Neighbourhood links are the width of the ports involved. The 
arrangement of the units ensures that data values of sufficient width can be passed 
around the network. If a result port is wider than the operand port it is connected to then 
superfluous bits are unconnected. If the operand port is wider than the result port then 
either sign or unsigned extension logic is added depending upon the type of the operand. 
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The possible neighbour connections that may be made to a unit are illustrated in the 
diagram below: 

• North Neighbour Connection 
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South Neighbour Connection 

The fitness of the connection is formed on the basis of the weighted sum of 
communications likely to be flowing in that direction based on the data flow matrix. Again 
this is based on the approximation of Euclidian distance to the required unit. 

Once the neighbour connections have been made all adjacent units are connected 
NESW by one connection. The direction of the connection will depend upon the particular 
data flows. However, these connections do not guarantee that a particular data value can 
be transmitted anywhere in the array. The next phase augments the connections to 
ensure that is possible. 



7.5 Reachability Closure 

This phase adds additional neighbour connections as required to ensure that any result 
from any unit can be passed to any operand of any unit. The neighbourhood connections 
added during the previous phase will provide most of the connectivity required to achieve 
this. The levels of data flow direct the connections added during that phase. They do not 
guarantee full reachability. There may be areas of the array that form islands that are not 
connected to other areas. Moreover, there may be operands and result ports that remain 
unconnected. This will definitely be the case if there are units with more operand ports 
and results than neighbours. 

A data structure is created with an entry for each unit in the array. Each operand and 
result port for the unit has an associated flag. For an operand port that flag is set to true if 
data can be passed to the operand port from the first result port of the origin unit (the unit 
in the centre of the virtual array). For a result port that flag is set to true if data from it can 
be passed to the first operand of the origin unit. These flags are used to determine if the 
array is fully reachable - all flags will be set to true in that case. 

The analysis starts with the origin unit and the physical neighbour connections are 
traversed to set the flags appropriately. Flows into the origin are followed backwards for 
the result flags and flow out of the origin are followed for the operand flags. The resulting 
flags represent the initial origin reachability of the units. The reachability analysis 
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maintains the minimum width connection that has been passed through Units with 
operands of greater widths are not considered to be reachable even if connections exist 
since some of the upper bits of a data value will have been lost. 

The reachability analysis does not allow the flow of data through units that already have 
one or more operands considered to be reachable. All operands must be reachable using 
routes not requiring use of other operands to a unit. This ensures that it is possible to 
nrfJ rands J 0 .,? unit one-by-one and hold them in the connecting bus as live until 

L ™ ere « K re may i ,S0 bS 08rtah 0ther °P erands th at are exempt from the 
reachability copy flow because they are required to be live for certain code generation 
s6C|u©nc©s. 

J^Un P *' aCed Uf ? S r wNch a " flags are not set te ada P ted - A new connection is added 
^ t " nreachable P° rtto a neighbour unit that is reachable. If no neighbours are 

^ « ni u'! frie * d u- at ,east one unit must a reachable neighbour. Each 
result flag also has an attribute of the number of bits width that are actually reachable so 
that suitable connections are made. Once the connection is added the analysis is 
extended to determine which units are now reachable. This process continues until all 
ports are reachable. This also ensures that all operand and result ports are connected 

The reachability analysis may include a maximum hop distance (not taking account of unit 

ESS H Pr T£ e a . Ceil I ing t0 rom ™nication times. This could be set to maximum 
length of a side of the virtual array rectangle. 

The diagram below shows an example of reachability closure analysis for a particular 




Operand being 
checked for 
reachability 



Paths used for reachability 

The dotted lines show the paths that are examined to prove the reachability of the 
operand from all possible result ports. In this case the operand is fully reachable. 

7.5.1 Exempt Connections 

A number of connections to particular operands on particular units are exempt from usage 
tor transporting data items when determining reachability closure. This is because these 
operands are required to hold live values across the usage of other operations. The type 
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and extend of these exempt connections is dependent on the nature of the source 
instruction set being translated from. 

The exempt connections are as follows: 

□ Memory Unit Address Operand: The address needs to be ^eld while performing 
a shift in order to implement a subword write. The address must be calculated 
before the shift (as the shift amount is dependent on the address). 

□ Memory Unit Store Operand: The store data needs special storage to allow the 
implementation of a swap function. The data to be stored is held while a 
load/store is performed on the other memory location. 

□ Shifter Amount Operand: The shift amount needs to be held while a load 
operation is performed. A shift may be performed after the load in order to 
implement a subword load. 

7.6 Connectivity Optimisation 

The purpose of the connectivity optimisation is to add additional connections to the base 
virtual processor array to improve performance. This is driven by the transport 
optimisation phase of the code generation. This generates a connectivity request matrix 
for the whole application. The requested connection with the highest weight that can be 
added is included into the system. There are a number of restrictions on the addition of 
new connections. Firstly, there is a maximum input multiplexer selection number and a 
maximum number of output connections from a unit. This may be parameterised 
depending upon the timing slack within the unit itself. Secondly, there is a restriction on 
the distance of a connection in the virtual array. A circle around the input to the unit 
represents this. The diameter of the circle is related to the timing slack of the consuming 
unit. 



8 Architectural Synthesis Example 

8.1 Introduction 

This chapter provides an example of architectural synthesis and code generation. A base 
architecture is generated with the required execution units. The architecture is then 
optimised using a performance critical loop of code. Custom connections are added and 
the impact on the resultant code schedule is shown. The final customized architecture 
along with the code schedule for the performance critical loop is shown at the end of the 
example. 

This example should help with a general understanding of how the process works in 
practice and the function of the individual steps. 

8.2 Source of Program 

The following diagram shows the source code for the code that is going to be used as the 
example: 
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int sample [ 1024] ; // input samples 

irrt output [1024] ; // filtered output samples 

int coeff[16]; // filter coefficients 

int i, *j, *k, sum; // temporary variables 

for (i = 0; i < 1024; i++) { 
sum = 0; 

for (j = &sample[i], k = &coeff[0]; k < scoeff [16] ; j+ + ; k++) { 
sum += (*j) * (*k) ; 

} 

output [i] = sum » 16; // shift to take account of fixed point 

} 

The code shows a 16-tap Finite impulse Response (FIR) filter ooded in the C language. 
This type of digital filter is very common and used in wide range of applications. The 
central loop multiplies a sample with a coefficient and keeps a cumulative total for the 
results with all coefficients. 

The source has been coded to minimise the number of instructions generated in the loop 
body itself. The same efficiency can be obtained with a less optimally coded example by 
using compiler optimisations. The hand optimised version is shown to simplify the task of 
matching the source code against the machine code generated by the compiler. 

The example is compiled using a standard RISC compiler. In this example the source was 
compiled using the ARM software development suite C compiler. 

8.3 Region Extraction 

The following diagram shows the short sequence of code forming the main loop of the 
filter: 

INNER_LGOP : 

LDR r7 , [rl ] , #4 Load a sample from memory and increment pointer 

LDR r8 , [rO ] , #4 Load a coefficient from memory and increment pointer 

CMP r0 , rl4 Check coefficient pointer against the array end - 

the rl4 register is loaded outside the loop 
MLA r2 , r8 , r 7 , r2 Multiply sample and coefficient and add result to sum 

BCC INNERJLOOP Branch back to the start of the loop until all 

sample/coefficient pairs have been processed 

This sequence of ARM instructions will be used as the example code for the rest of the 
chapter. The code simply loads a sample and a coefficient, multiplies them and adds the 
result to a cumulative total held in the R2 register. The code makes use of the ARM 
addressing modes allowing pointers to be automatically incremented. The code also uses 
the ARM ^multiply and accumulate operation that combines the multiply and the addition. 
Comparing one of the pointers against the address after the end of the address range 
performs the loop end detection. 
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The control flow analysis identifies the branch to the INNERJ.OOP has a backward 
branch. This allows the sequence of code to be processed as a loop and, as such, the 
code is allocated to a separate region. 



8.4 Code Translation 

The diagram below shows the translation of the ARM code into a CDFG representing the 
base operations to be performed by a CriticalBIue processor: 
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A dashed line bound the operations generated for each individual instruction. The 
associated instruction is shown alongside. The CDFG is generated directly from the ARM 
code without any other intermediate representation. 

The load instructions generate six independent operations due to the use of the complex 
addressing mode. The base register is read, the access performed and the loaded data 
written to the appropriate destination register. An immediate value of 4 is loaded and 
added to the base register and the base register updated in the register file in order to 
perform the address increment. 

The MLA (multiply and accumulate) is also translated into five individual operations. The 
source registers are read and the multiply performed. The accumulator register is then 
read, the multiply results added to it and then the new total written back to the register. 

The compare instruction is translated into three operations. The two operands are read 
from the register file and a subtraction performed. The same execution unit (arithmetic) is 
used for both addition and subtraction but with a different method selection. Note that in a 
real CriticalBIue processor a special adder only unit is used for addressing addition: In this 
example, the same unit performs the two functions. This is for the sake of simplicity. 

The branch is implemented by a squash and branch operation. The squash operation 
reads the carry output from the arithmetic unit performing the comparison. The squash is 
used to control the execution of the branch operation, which will be placed in a different 
strand. The branch operation does not require an immediate operand as the branch is to 
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the start of the current region. That address can be simply selected by using a particular 
branch method. Commit operations for different strands are not shown in this example for 
the sake of simplicity. 

Note that this example does not show PC load, PC register update or instruction predicate 
check operations. These are normally generated during the translation but deleted from 
the main code. They are not shown in order to simplify the example. 

The translated code will produce the same register results on an ARM instruction 
granularity basis. 

8.5 Transport Optimisation 

Once the CDFG for the loop has been built it is optimised in order to remove unnecessary 
operations. The optimisations applied to the example CDFG are shown below: 




The optimisations are focused on improving the transport of particular data items between 
operations. This is done by eliminating register reads and writes so that data items can 
flow directly between operations without go through the register file, wherever possible. 

Three applications of register bypassing are applied. The deleted operations and data 
flows arcs are shown dotted. New data flow arcs are shown that obtain the data from an 
earlier producer. 

Firstly, the write and read of register R7 (one of the parameters to the multiply) can be 
optimised away as R7 is not live at the end of the loop. Thus data can flow directly from 
the memory unit to the multiply unit. Secondly, the other operand in R8 to the multiply is 
similarly optimised. Thirdly, the read of RO to obtain the current address for comparison 
can be optimised away. In this case the corresponding write of RO is retained as the 
register is live as a loop carried register. 

8.6 Minimum Set of Execution Units 

The following diagram shows the set of execution units that are required to execute the 
main loop of the filter: 
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Immediate t> 



> Reg File |> 



<^ Memory £>• 



> 



* 



> 



^ Arithmetic |> 
^> Squash 



^> Branch 



In a complete implementation some other execution units are required (such as a logical 
unit) to enable the translation of all instructions in the ARM instruction set. However, for 
the sake of brevity only those units required for the example are shown. 

A commit unit is normally required to mark the transition of strands from their speculative 
to committed execution phases. Again, for the sake of brevity, this is not shown. 



8.7 Unallocated Virtual Array 

The" architectural synthesis must allocate the functional units into an array configuration. 
Before the allocation commences there is effectively an infinite virtual array into which the 
units may be allocated. This is illustrated in the diagram below: 
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Virtual positions to 
be filled with 
functional units 



Array Origin - 
Register file unit 
" always placed at this 
position 



The origin position at the centre of the array is always allocated with the register file unit. 
This unit is always placed centrally because it is likely to require the high levels of 
communication with the other functional units. Placing it centrally minimises its average 
distance to all other functional units in the system. 

As additional functional units are allocated to the array they are placed adjacently to 
existing units. The placement ensures that a largely rectangular block of units is 
generated. The aspect ratio of the allocated blocks can be controlled by the user as an 
input parameter. 

8.8 Data Flow Matrix 

The unit placement in the virtual array seeks to place units that communicate frequently 
close to one another. This means that there is a higher likelihood that they can be 
connected using direct physical connections. Neighbour units can be allocated direct 
connections. If the units have to be communicate indirectly then if they are physically 
close then that minimises the number of intermediate steps required to communicate 
operands. This reduces the latency of operations. 

To direct the unit placement process a matrix is built up representing the data flow within 
the application. An example matrix is shown below: 



31 



CriticalBlue. 



Architecture Generation 





Arithmetic 


Immediate 


Memory 


Multiply 


Reg File 


Arithmetic (Opl) 


o 


0 


1 

•J* 


1 




Arithmetic (Op2) 


0 


2 


1 


o 


1 ! 


Branch 


o 


0 


o 


o 


0 


Memory 


0 


0 


0 


0 


2 


Multiply (Opl) 


0 


0 


0 


0 


0 


Multiply (Op2) 


0 


0 


0 


0 


0 


Reg File 


4 


0 


0 


0 


0 


Squash 


1 


0 


0 


0 


0 



The rows of the matrix represent the input operands of each of the functional units in the 
system. The columns represent the output ports of the functional units. The values in the 
matrix indicate the number of transfers that are made from a particular output port to a 
particular input operand. Thus the quantity of data flow between them is represented. The 
amount of data flow is also weighted by the relative priority of the functions in which the 
communications occur. Thus important loop kernels have a much higher impact on the 
values in the matrix than infrequently executed control code. 

The values in the matrix are obtained from an idealised representation of the CDFG. In 
that representation all possible transport optimisations are performed, independently of 
whether that could lead to a potential deadlock situation if real code was scheduled from 
it. The matrix values represent the transfer of data between actual operations rather than 
with the register file, in so far as possible. 

8.9 Unit Placement 

Once the data flow matrix has been calculated the next step is to form the placements of 
the individual functional units. The units are placed so that units with high levels of inter- 
communication are placed as close to one another as is possible. An example is shown in 
the diagram below: 



32 



CriticalBlue 




Archmbture Generation 
Placement ordering 



Thus units that are used very Sequent I S 0rder * the,r fre ^ncy of usage, 
allows improved communicator,^? ^ o^eNmoSSt uhS'^h^ ° f the ™ s 
frequently used units or those S have bTen w»h * > 2 h 83 the register file - Les s 
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bal^ It 8 ' p,a t cement of »» uni te and the choice of initial neighbour connections is 
8.1 1 Reachability Analysis 
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any possible sequence. Obviously most of the paths will be hiqhlv indirect r^nnM™ ^ 
data operands to be copied via functional units a rtmtJTtoS^^M 
dest|nat.on is reached. Thus the efficiency of the code is highly dependent upon the exS 
funcfonal unrts chosen and the actual data flows required The ptaSE^^S 
connectrons created between them is focused on making those paths effident 

The additional connections that need to be added for the reachability closure in the 
example are shown on the diagram below: «w iaD miy closure in the 



Input operand previously unconnected 




Additional 
connection makes 
second operand of 
arithmetic unit 
reachable from 
register file (via 
Mem and *) 



The first operand of the multiplier unit was not previously connected to anv result nnrt and 

SJSf 7£ reaChab ' e fr ° m anyWhere - A new ^nnection is added to connect ft to tKe 
outpu of the memory unit like the other operand to the multiplier. The se^d ooerand to 

*" PreVi ° U ? y ° n,y fed fr0m * e immediat " unit stiS S rtK 

SsES^S-ass ssswss 

?!^.? e . '?f haDillt >' anfysls is complete the airay of functional units is fullv connected 

rnSrs^ss 00,3165 *° pass a - *™ 

8.12 Data Flow Labelling 

o^e'Cm JET" " CDFG ** 810 teCy * 819 * ™» te 



3S 




of 3 ^rtE ? T 6 8 ' ate " Cy ° f 1l except for multi P'y has a latency 
So fi w J'" 19 f tUally occurs durin 9 the construction of the CDFG as the latencies 
are fixed propert.es of the execution units required by each operation The labeHnq fe 
shown here as a separate step for the purposes of the example. 9 

n^e?^h^CDFr ^l^"'" 9 / 18 n ? e CritiC8,ity ana,ysis t0 determine "«* critical 
nodes .n the CDFG These are the nodes that are in critical paths of data flow and whose 

placement in the schedule is likely to have the most impact on overall execuS time 
8.1 3 Node Criticality Analysis 

°w ? 6 5 C labe,ing is completed, node criticality analysis is performed Possible results 
of this for the example are shown on the diagram below: rossioie results 
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SEL/E* 5 IT arked ^ 3 number indicat| ng criticality. Lower numbers are more 
2f? M, hl ? her n T berS - ^ P^fonnhg scheduling, nodes with lower n*mbe* Ta re 



The numbers are derived from the node free float calculated as part of the critical oath 

Sntf ^n," iS / measure of how much *e Placement 5 node h the schedu e 
can be vaned wrthout affecting the overall execution time of the region Highly critical 
nodes on the critical data flow path have a free float of 0. V 

8.14 Mapping to Architecture 

The diagram below illustrates the scheduling problem of mapping the CDFG to the 
architecture of the processor. A small segment of tine CDFG is shown: ' 
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At clock N the multiply operation is mapped onto the multiply unit. As the multiply has a 
atency of 3 clock cycles the next operation cannot be, initiated until clock cycle N+3 At 
this point the add operation is performed on the arithmetic unit. Since data can be fed 
directly from the multiplier to the arithmetic unit no intermediate copy operations are 
required and the two calculations can be executed back-to-back. Finally at clock cycle 
N+4 the register write is performed on the register file unit. Again the units are directly 
connected so no additional copy operations are required. 

8.15 Unit-Time Representation 

The diagram below shows the initial Unit-Time representation for scheduling on the 
example architecture: w 
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On each clock cycle there is an entry for each functional unit in the processor The table is 
used to show a graphical representation of how those units are used over time The 
purpose of the scheduling process is to minimise the overall schedule length (ie total 
number of clock cycles to execute a given block of code) by making as much' use of 
parallelism between functional units as possible 



38 



CriticalBlue Architecture Generation 

I 

8.16 Connectivity Optimisation 

If the filter loop code is marked as being critical then it is used to customize the 
connectivity of the final architecture. Additional connections are added that correspond to 
data flows in the loop. Addition of these connections improves the performance that can 
be obtained. Connections are added one by one. That is, a single connection is added 
and then code is scheduled again to measure the connections impact on the schedule 
generated and determine what other connections could be added. 

The first two custom connections that may be added by the example code are shown in 
the diagram below: 
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Firstly, a direct connection is made between the register file output and the first operand of 
the arithmetic unit. This allows the address registers to be read directly into the arithmetic 
unit in order for the addresses to be updated. A second direct connection is added 
between the output of the arithmetic unit back into its second operand. This allows 
cumulative results to be calculated without having to perform copy operations through 
other functional units. In effect this forms the accumulate section of a multiply-accumulate 
unit. The connections required for a multiply-accumulate unit emerge automatically in the 
architecture just from analysis of the filter loop dataflow. 

8.17 Final Schedule 

The impact of the custom connections on the schedule are shown in the unit-time 
diagram below: 
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The legends above the blocks of functional units show the set of operations being 
executed on each dock cycle. As illustrated a number of operations are completed on 
each clock cycle. The configuration of connections between the execution units (as 
configured by the measured data flows) ensures that the majority of data flows are directly 
from one unit to the next. In most cases the data arrives on exactly the clock cycle that ft 
Snmnnl?^ S ° 2 rt J s " sed immediately. One exception is the second operand to the 
ESffiL*"!? dafe V S aVai ' able ° ne Cl0ck c y c,e eariier *» it needs to be 
i« ^rnn^ h m n P l° Perat,0n . Cann0t Start until »» second °P erand is mailable. The data 
LtlTL^K ' n ^ 9 J P u Pr0pn f. te 0Utput ^ster from the memory unit until it is required. 

SfmS.v^n?h U I ? I? """ft I TOt rea , d f ° r cyc,es after •» °P eration as 
tne multiply unit has a latency of that many clock cycles. 

The shaded operation executed on the register file indicates a copy operation. The copy 
direcfly connected P8SS8d arithmeti ° Unit t0 ** SqUaSh Unit aS the8e are no * 

IS.^?' 8 * 6 '^ 16 , Unit 1 acc f ssed on evef y si n9le clock cycle. If more parallelism was 
available in the loop then the average number of register file accesses would tend to 
reduce, thus relieving pressure on that unit. 

The customized data paths allow the entire loop to fit into 8 clock cycles. Higher averaqe 
performance is possible with a loop containing more parallelism. However, such a loop 
would be too complex for a worked example. 

tf ]?, issue °/.? e sc l uash °n the last cycle of the schedule would not be legal on a 
h!!i nt iu a, u IUe a u rch,tecture - as ft wuld have to precede the commit for the strand 
holding the branch operation. These commit operations are not shown in this example 



8.18 Summary 

This example has illustrated how a specialised architecture is automatically generated 
using the data flow in example code to direct the connectivity. The architecture is 
optimised for execution of a FIR loop. However, many of the connections that have been 
added to the architecture will be commonly used by other, unrelated, code and provide 
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^Significant performance benefits. A direct connection is automatically Generated hefwroi 
the output of the arithmetic unit and one of its inputs to 3S^SSKi5?t2 
anthmetic unit is also placed next to the muKiply unit so thatresufe SnTwdirSy 2o 
op»n° P eff6Ct ' 3 ^ mu,ti P'y- acc ^^te unit has emeS t 



The example used consists of only five host instruction* Thorof^ fk« . * 

a^Th 3 ^ fe minimaK Wrth 3 ^^S^ ?S2£ 
and thus there is more scope to use many more of the functional units simultaneoudy 

More parallelism can be obtained from the FIR filter by unrolling the loop by hand so that 
each iteration performs a number of multiply-additions. The CDFG optimised 3f£ 
able to el.minate more register file accesses as a higher proportion will be to rSe^ ttS 
are not live outside of the loop and are not loop carried. Wit , fuH opSsation sel?cfed toe 
compiler may also perform loop unrolling. numisaoon selected, tne 

The MetaMapper is itself capable of performing loop unrolling. This was not shown for this 
example for the sake of brevity. However, the example loop would bTunro led°o ?ncrerse 
^™ ber o[ operations in the loop and thus potential parallelism. MeteMappJ^ ?S5S 
J are ,s a P° ten « al of an exit on each original iteration by nTppto nSe 
SEE? - Strands in the same reaion - On oe unrolled a multiply could I be 

itemffrom ° th f er CyC L 6 - The P erforman oe bottleneck is the need to obSi^ro da5 
items from memory for each calculation. The code could be rewritten to model^ arSw I 
second memory unit. The operands can then be obtained frw^e^^ unfit 
and allow a performance of one multiply-accumulate per cycle to be aS^d VWth 
further memory units added the performance can be grown even further 

9 Structural HDL Creation 

9.1 Synthesis Methodology 

A structural HDL description of the processor is generated from the description file This 
HDL descnpfton instantiates all the functional units in the array and provkte rnodelted 
connectMty between them. The HDL indicates the layout ^pSITm 
the place and route stages are directed such that the spatial mode used durina 
architectural creation correlates with the actual layout. 9 

9=2 Testability 

L h p e po^e^ tested and debugged using the simuiation framework 

supported by CrticalBlue. However, this is only a software simulation usino the 
behavioural models supplied by the user. It does not directly test th^e hSwaS of 2e 

both onor £VJ™ SS °- t har ?T re .r dUCed * me Critica,B ' ue mustTtestabte 
both pnor to the commitment to silicon and in a manufacturing flow There are hvo 
essential aspects to this testing. They are unit level and system level testing 

naL U ? il If V t! l f S ? n ?- each 0f 016 individual oo"^ 01 and functional blocks in the processor 
need to be tested. Test benches with appropriate stimulus and ex'pected Lponsefmus 

devSoSv StS 0 " 83 H nd V8Ct0rB VW "" 56 SUpplied for toSl^XSZTat 
developed by CnticalBlue and are present in each processor. 

SSn d^Snf^S'Z*? ^ th 1 en -i re Pr ° CeSSOr indudin 9 a » the interconnections 
vJri^nf fi f , and funct,onal un,ts within it. The interconnections between the 

vanous functional un.ts are generated automatically so should not be incorrect Before 
producing silicon, hardware simulations need to be performed to verify toa? ? toe Sse 
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or the manufacturing test flow they do need to be tested since they might be damaqed 
by a silicon defect. a 

The CriticalBlue tools produce the entire processor as a soft core in RTL Existinq RTL 
simulation tools can be used to simulate the entire system running application code 
produced by the CriticalBlue compilers. As the design is progressed through logic 
synthesis and the place and route stages, further simulations are performed at the qate 
£? h • P ?k l f on J de " ce in ^ desi 9n- The application code used would normally 
include test benches for the individual execution units in order to verify their functionality. 

To assist with manufacturing test, the CriticalBlue tools are able to automatically generate 
test vectors to verify the interconnection network between the functional units. The vectors 
also test the logic generated automatically for each functional unit. These vectors are 

CritictlB?ueSs teStin9 ° f ^ dSSign 0131 produced automatically by the 

The test vectors use the debug chain is used to inject particular data patterns into the 
output registers of the functional units. The vectors test all the data paths in the desian 
and the operation of the multiplexers. A special test bypass is provided to route data from" 
the input operands of a functional unit to its output without using the execution unit This 
allows the functional unit connectivity to be tested in isolation from the execution unit 
functionality. The test vectors are different for every design because the connectivity is 
different. Complete coverage of all the automatically generated parts of the design is 
guaranteed This a lows the engineer to concentrate on verification and testing efforts on 
the parts of the design that they have produced. 

The testing of the individual execution units is the responsibility of the user, as it would be 
JLT nar ;dware design block. The CriticalBlue tools cannot generate vectors to 
test them since it doesn't know their internal structure. The engineer must produce a 

S h i and f Pected ° Utputs for exe ^«°n "n't- usually generated from a 
test bench These are designed to test the full functionality and gates/connectivity within 
the execution unit. Although the CriticalBlue tools cannot actually Test the execution mtts 
they prov.de support by generating test vectors that present the correct stimulus and 
reads the outputs from a particular execution unit. 

9.3 Dedicated Input Ports 

9.3.1 RSN Port (hw RSN) 

The RSN port may be routed to particular execution units. The Result Strand Number (RSN) 
shows the strand number of any new operation being initiated- on the functional unit. The RSN 
is obtained from an immediate value specified for an operation. 

For instance, memory units containing an SWB use hw_RSN to determine which strand a 
particular memory operation is associated with. This allows a speculative write to be committed 
or discarded as required depending upon the final status of strands in the region. 

The simulation environment provides hw_RSN has a global integer variable The 
behavioural model of an execution unit may use this. It is set to the appropriate value 
before calling the behavioural model of each execution unit. 

9.3.2 SEM Port (hw_SEM) 

J?«« E u P ° rt be routed t0 P articular execution units. The Strand Enable Mask 
(SEM) shows which strands are currently being executed. There is one bit for each 
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ssible strand in the processor. If a particular strand bit is reset then that indicates that 
the strand is disabled and no operations from the strand should be initiated. 

For instance, memory units containing an SWB use hwjSEM to determine which entries to 
discard at the end of the region. Entries in the SWB from enabled strands are committed 
while those from disabled strands are discarded. 

The simulation environment provides hwjSEM has a global integer variable. The 
behavioural models of execution units may read it. 

9-3.3 ERF Flag (hwjERF) 

The End Region Flag (ERF) may be routed to particular execution units. The ERF is 
a single bit from the execution word that indicates that the word is the last in the 
current region. ERF is used by execution units that need to perform particular actions 
at the end of a region. 

For instance, memory units containing a SWB use hw_ERF. The assertion of the end 
flag indicates that speculative writes have become committed and that writes from 
squashed strands should be purged. 

The simulation environment provides hw_ERF has a global Boolean variable that may be 
read by behavioural models of execution units. 



9 A Dedicated Output Ports 
9A1 Wait Flag (hw_wait_x) 

This is a dedicated wait flag that may be asserted by any execution unit. If asserted then 
the entire pipeline of the processor is stalled. A number of individual wait ports are 
available to allow several functional units to have a wait capability. The x value gives the 
wait signal number. 

The wait must be asserted very early in the last execution cycle of the functional unit. It 
must be available early as it is combined with all other wait signals and routed to all 
functional units. When wait is asserted the clocks for all functional units except those 
asserting wait is gated off. Thus the same state is maintained in the other functional units. 
If clock gating is not available then the assertion of the wait signal causes register 
feedback in each of the pipeline stages of the functional unit. Thus the pipeline is not 
advanced while wait is asserted. 

The wait flag is used for functional units that may occasionally have an extended latency. 
The frequent shorter latency is specified as the fixed latency of the unit and then the wait 
flag is used to extend the latency when required. Possible uses of this mechanism are as 
follows: 



□ Implementation of a cache. The wait is asserted when there is a cache miss. 
While the processor pipeline is stalled the cache unit can access main memory to 
obtain the required cache line. A minimum latency of two clock cycles is generally 
required to allow sufficient time for a tag lookup, comparison and then assertion of 
the wait signal early in the second cycle of execution. A one cycle latency cache 
could be implemented but at the expense of an elongated cycle time. 

□ Implementation of denormalised floating point operations. Denormalised values (if 
implemented) do not occur frequently but usually require an extra cycle of 
processing for floating point operations. Assertion of the wait line allows an extra 
cycle to be inserted. . 
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□ If a Speculative Write Buffer (SWB) is being used on memory units then the wait 
signal can be used to delay the pipeline if the buffer becomes f^The pipeHnels 
stalled wh.le entnes are written to memory and free positions become available 

Note that if ihe wait signal is used then the maximum interrupt latency (for both fast and 
slow rnterrupts) is extended by the maximum possible wait period. 

9.4.2 Squash Vector (hw squash x) 

A squash vector is generated by a squash operation and used to reset hit* in tho 
Predrcate Mask (SPM). A number of individual squash vecl* ar ^sulorteS Torter to 
aHow the use of multiple squash units. This allows parallel squash operations t iS 
b^Sie Sft ffi* "* iS 3 P6rf — b ~ ^ numbe^Tsquas? 

corresponding strand and prevents any results from Z steTtenTcom^* M 
squashes for a strand must be resolved before it enterc its comSed phase 

The branch control unit generates one squash vector. This allows it to saua^h »n ^nrto 
h.gher than a strand that is performing a" branch. These late 

S^iattSaS if an ear,ier branch is ™ e squas * ssrs 

An individual squash operation is able to generate several squash flags The souash flaas 
foto stnd, 9en h e - ati0n Pr ° CesS and a,,ow M fl ° w constats to be Slapsed 

squash Tunrt SqU3Sh ' S PrediCated ° n 3 particular c^ 0 " 3 ' value supplied to the 

H5£ tt?3t ? e ^ SqUasl ? Unit must ensure ^ 016 bit s in the squash vector are reset for 
cycles in which no valid squash operation was issued. 

9.4.3 Abort Bus (hw__abort_x) 

ttettmL^ ha2ard ° r guard °P eration and ««« to reset bits in 

to JX Mask SAM). A number of individual abort buses are supported h ader 

^JS^wT^J^ check rf ha2a * This allows parallel checktoS* 
. P ndSby^ !v2x ' S 3 Perf ° rmanCe ta * ni * ^ number of the ab °* bus is 

Each abort bus specifies a particular strand number with a separate valid bit to indicate if 
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i^^imulation purposes the variable hw_akort_x is declared as a global integer. This 
may be written with the appropriate abort strand by the behavioural model of a unit. The 
simulation kernel reads these variables in order to update the simulated SAM value 
showing which strands should be aborted. 

9.4.4 Commit Bus (hw_coinmit_x) 

A commit bus is generated by a commit unit and used to updated the Committed Strand 
Mask (CSM). This in turn enables the selection of more possible branch destinations from 
the branch control unit. A number of individual commit buses are supported in order to 
allow the use of multiple commit units. The number of the commit bus is indicate by the 
value x. 

Each commit bus specifies a particular strand number with a separate valid bit to indicate 
if the strand number is valid. This causes the bit in the CSM corresponding to the strand 
to be set if that strand is currently enabled. This allows any branch issued by the strand to 
be selected if no previous branch has been issued. If a breakpoint has set on the strand 
then this causes the selection of the breakpoint branch. 

Note that the commit unit must ensure that the valid bit in the commit bus is reset for 
cycles in which no valid operation was issued. 

For simulation purposes the variable hw_commit_x is declared as a global integer. This 
may be written with the appropriate commit strand by the behavioural model of a unit. The 
simulation kernel reads these variables in order to update the simulated CSM value 
showing which strands have been committed. 
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